Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASSETS-8997 add serviceoverload error reason #71

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

adamcin
Copy link

@adamcin adamcin commented May 24, 2022

jira: https://jira.corp.adobe.com/browse/ASSETS-8997
downstream dependent @adobe/asset-compute-sdk PR#182

This change adds a new rendition_failed reason (ServiceOverLoad) for use by asset compute workers that encounter upstream API rate limiting and need to indicate to downstream clients that a resubmission of the original asset compute request is necessary after some time has passed.

Also defined is a ServiceOverLoadErrorType, which extends ClientError rather than GenericError, because it is defined in the spirit of HTTP 4xx (429, specifically).

@@ -121,6 +122,14 @@ class RenditionTooLarge extends ClientError {
}
}

// Worker encountered upstream API rate limiting. Client may resubmit request after some time.
class ServiceOverLoadError extends ClientError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a similar error in api-process already (look for TooManyRequestsError). Could we merge those two classes into one instead, and refactor a bit where possible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I based the ServiceOverLoad reason name on the design documents attached to the ASSETS-8997, though there are places where the error is listed as "TooManyRequests/ServiceOverLoad", and it wasn't clear if there was ambiguity over which error name to use throughout, or if it was intentional to have both. I personally see value in supporting both errors, since TooManyRequests is more readily associated with an HTTP 429 response originating from the Asset Compute Service itself, with the ability to provide a retry-after directive for the client, while ServiceOverLoad would represent a more general error type that Asset Compute can throw asynchronously when it encounters throttling from upstream/3rd-party services (such as when a worker receives a 429 Too Many Requests HTTP response).

If the AEM client receives either error, the proper behavior is to retry the original after some time has passed, but with TooManyRequests, the client may be given an explicit Retry-After, whereas with ServiceOverLoad it's basically

Retry-After: 🤷

I kind of had a hybrid approach in mind where we could support both of these error types in AEM for rendition_failed events just in case, and define both types in asset-compute-commons, along with making the semantic distinction more clearly defined along the lines I described above. Would that work?

Copy link
Member

@tmathern tmathern May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would work, but I'm not sure if having two different errors names initially was intentional or not. @pheenomenon probably can clarify if the two different errors where intended or are just "synonyms" (talking about current design, not what we'll have eventually).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of "TooManyRequests/ServiceOverLoad" was not meant to be the same. It was used so to only express the idea.

I have seen, our downstream services could get overloaded for a variety of reasons and return 500 instead of 429. So I like the idea of keeping it flexible as ServiceOverload instead of TooManyRequestsError.

To the question if TooManyRequestsError (we use in api-process for Nui throttling) should be converged to ServiceOverload - we can take that route if we want, but that won't have an API dependency with AEM and won't bring a huge advantage. So hybrid approach sounds good to me too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case (which is also what confused me): Although 500 is generic, 503 should be ServiceOverload then (503 usually means server is busy - but our services don't use it yet as far as I know). Otherwise it could be confusing for developers using our APIs: Why do they get an Overload error when there is a 500 (which could be anything, since it's generic)?

Copy link
Member

@tmathern tmathern May 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Could we maybe still move the TooManyRequests exception here too, while at it, @adamcin? If it doesn't throw you off-track?)

@tmathern tmathern self-requested a review May 31, 2022 15:45
Copy link
Member

@tmathern tmathern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, with the note that ServiceOverload should be reserved for HTTP code 503, and not generic.

Copy link
Contributor

@jdelbick jdelbick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please additionally update the readme with the new error type and description:
https://github.com/adobe/asset-compute-commons#custom-errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants